Optical Character Recognition Engines Performance Comparison in Information Extraction

نویسندگان

چکیده

Named Entity Recognition (NER) is often used to acquire important information from text documents as a part of the Information Extraction (IE) process. However, quality affects accuracy data obtained, especially for acquired involving Optical Character (OCR) process, which never reached 100% accuracy. This research tried examine OCR engine with highest performance IE using NER by comparing three engines (Foxit, PDF2GO, Tesseract) over 8,562 government human resources within six document categories, two structures, and four measurements. Several essential entities such name, employee ID, number, publishing date, rank, family member's name were trying be extracted automatically documents. processes done Python programming language, preprocessing tasks separately Foxit, Tesseract. In summary, each has its drawbacks benefit, Tesseract better extraction conversion time but lack in number acquired.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Rapid Feature Extraction for Optical Character Recognition

Feature extraction is one of the fundamental problems of character recognition. The performance of character recognition system is depends on proper feature extraction and correct classifier selection. In this article, a rapid feature extraction method is proposed and named as Celled Projection (CP) that compute the projection of each section formed through partitioning an image. The recognitio...

متن کامل

Simple and Effective Feature Extraction for Optical Character Recognition*

A new representation method for recognition of handwritten charcters, called LLF (Local Line Fitting), is presented. The method, based on simple geometric operations, is very efficient and yields a relatively low-dimensional and distortion invariant representation. An important feature of the approach is that no preprocessing of the input image is required. A black & white or gray-scale pixel r...

متن کامل

Optical Character Recognition Using 26-Point Feature Extraction and ANN

We present in this paper a system of English handwriting recognition based on 26-point feature extraction of the character. Basically an off-line handwritten alphabetical character recognition system using multilayer feed forward neural network has been described in our work. Firstly a new method, called, 26-point feature extraction is introduced for extracting the features of the handwritten a...

متن کامل

Shape-Free Statistical Information in Optical Character Recognition

Shape-Free Statistical Information in Optical Character Recognition Scott Leishman Master of Science Graduate Department of Computer Science University of Toronto 2007 The fundamental task facing Optical Character Recognition (OCR) systems involves the conversion of input document images into corresponding sequences of symbolic character codes. Traditionally, this has been accomplished in a bot...

متن کامل

Optical Character Recognition

This paper describes two implementations in optical character recognition using template matching method and feature extraction method followed by support vector machine classification. With proper image preprocessing, the texts are segmented into isolated characters and the correlations between a single character and a given set of templates are computed to find the similarities and then ident...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: International Journal of Advanced Computer Science and Applications

سال: 2021

ISSN: ['2158-107X', '2156-5570']

DOI: https://doi.org/10.14569/ijacsa.2021.0120814